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Abstract —On Intel Sandy Bridge processor, last level 
cache (LLC) is divided into cache slices and all physical 
addresses are distributed across the cache slices using 
an hash function. With this undocumented hash function 
existing, it is impossible to implement cache partition based 
on page coloring. This article cracks the hash functions 
on two types of Intel Sandy processors by converting the 
problem of cracking the hash function to the problem 
of classifying data blocks into different groups based on 
eviction relationship existing between data blocks that are 
mapped to the same cache set. Based on the cracking 
result, this article proves that it’s possible to implement 
cache partition based on page coloring on cache indexed 
by hashing. 


I. Introduction 

Cache plays an important role in bridging the gap 
between the speed of processor and main memory. Many 
cache architectures have been proposed in history. Hash 
is an important technique to improve cache performance, 
such as hash-rehash cache. As the succeeding generation 
to Nehalem, one of Sandy Bridge processor’s new fea¬ 
tures is that LLC is divided into several slices which are 
connected by a ring bus, as shown in Figure [T] And the 
location of a given data block on LLC is decided by an 
undocumented hash function. 

This article proposes a novel method using HMTT (H 
to crack the hash function and further verifies the cor¬ 
rectness of the cracking result based on the phenomenon 
that when the number of accessed data blocks that are 
mapped to the same cache set exceeds the associativity 
of cache set, average access latency increases sharply. 
Compared to the statement that page coloring doesn’t 
work on caches that are indexed using hashing m, this 
articles ported User Level Cache Control, which is a soft¬ 
ware runtime library used to improve cache performance 
by implementing cache partition based on page coloring, 
and proves hat it is possible to implement cache partition 
using page coloring on Intel Sandy Bridge processors. 

This articles has the following contributions: (1) ver¬ 
ifying that bit substring of physical address is used to 
select cache sets and that it is possible to partition cache 
capacity based on set index; (2) cracking hash function 
on Sandy Bridge 4 core processor and further the hash 
function to a simple function formula; (3) cracking 
hash function on Sandy Bridge 6 core processor and 
presents the hash function in the form of 32 mapping 


tables. As different set indexes correspond to different 
mapping relationship, attentional attention are needed 
when designing the page coloring mechanism. However, 
because it’s true that bit substring of physical address 
is used to select cache sets, although the hash function 
hasn’t been reduced to a simple formula, it’s still possible 
to implement cache partition based on cache set index. 




Fig. 1: LLC organization on Intel Sandy processor. 

This article is organized as follows: Section 2 defines 
the problem. Section 3 describes the procedure to crack 
cache hash function. Section 4 presents the observations 
based on which this article comes up with the assumption 
about the implementation of the hash function. Section 
5 describes the assumptions about the implementations 
of the hash function. Section 6 presents the details of the 
cracking scheme. Section 7 presents the cracking results. 
Section 8 describes the details to verify the correctness 
of the cracking result. Section 9 describes the results of 
performance of cache partition implemented based on 
page coloring. Section 10 summarizes the contributions. 

II. Problem 

All the physical memory is divided into data blocks, 
and the size of data block is the same with the size of 
cache line. 























































































































































(a) Cache lines in one cache set are located on different (b) Consecutive cache sets are located on different post¬ 
positions on physical cache slice tions on physical cache slice 


Fig. 2: Different mapping mechanism between cache set and physical cache. 


A mapping function exists between data blocks and 
cache sets. For an unknown processor P, let C cac h e be 
the capacity of the cache, let C memory be the capacity 
of the memory installed, let C ca cheiine be the size of 
cache line, let associativity be the associativity of the 
cache, and the total number of data block is 


C memory 
block — ^ 

^ cacheline 

(1) 

the number of cache sets is 


U cache 

(2) 

Uset ~ ax C cacheline 


On Sandy Bridge processor, LLC is also set-associative. 
What’s different is that one hash function exists to dis¬ 
tribute all data blocks across cache slices. Let location 
is the location on cache where a given pa is stored; as 
LLC is divided into slices, location consists of two parts, 
slice_id and set_index. slice_id is the LLC slice the 
data block is mapped to, and setjindex is the index of 
the set the physical address is mapped to. 

Using these notations, this article defines the hash 
function on Sandy Bridge processor as follows: given 
a physical address pa , slice_id and set_index , this 
article defines two functions ma_to_slice_id () and 
ma_to_set_index () to describe the relationship 

slice jid = ma_to_slice_id (pa) (3) 

setjindex = ma_to_set_index (pa) (4) 

III. Procedure 

Figure [3] presents the procedure to crack the cache 
hash function on Intel Sandy Bridge processor. Based on 
some observations in prior work and some experiments, 


this article first presents a hypothesis. Then this article 
proves the hypothesis to be correct. 


Theory 



Fig. 3: The procedure to crack the hash function. 


This article proposes a logical model to describe the 
problem clearly. As presented in a, physical cache 
has complex organization. The method discussed in this 
article isn’t able to distinguish between the following 
situations as presented in figure [2] In order to describe 
the problem clearly, this article proposes a logical model 
of cache organization, as presented in figure [5] In this 
model, each slice is divided into different cache sets, 
the number of cache sets on each cache slice can be 
decided based on the capacity of each cache slice, the 
size of cache line and associativity of cache set. Each 
cache slice consists of the same number of cache sets. 
One exact cache set is selected by specifying slice_id 
and set_index. Besides, without specifying slice_id , the 



































































































































































































































































































































































(a) Stride varies from 64B to 64KB. 


(b) Stride varies from 128KB to 1024KB 
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(c) Stride varies from 2MB to 16MB. (d) Stride varies from 32MB to 256MB. 

Fig. 4: Perform data dependent access with different stride. For each stride, there is a point that when the average 
access latency begins to increase sharply. 


cache sets with the same set_indexon each cache slice 
will be selected. 
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Fig. 5: Logical model about Intel Sandy Bridge LLC 
organization. 


Table 1. The number of cache lines that reside in cache when 
accessing using different stride and array size. 


stride 

number 

stride 

number 

64B 

15 x 2 14 

32KB 

15 x 2 5 

128B 

15 x 2 13 

64KB 

15 x 2 4 

256B 

15 x 2 12 

128KB 

15 x 2 3 

512B 

15 x 2 11 

256KB 

15 x 2 3 

1KB 

15 x 2 10 

512KB 

15 x 2 3 

2KB 

15 x 2 9 

1MB 

15 x 2 3 

4KB 

15 x 2 8 

2MB 

15 x 2 3 

8KB 

15 x 2 7 

4MB 

15 x 2 3 

16KB 

15 x 2 6 

8MB 

15 x 2 3 


IV. Observation 

A. Main memory and LLC access latency 

When the data accessed can’t be held in cache, cache 
miss will cause average access latency to increase. 

B. Substring of physical address serves as set index 

As described in section methodology, the test program 
is able to access data blocks in specified cache sets. This 
article first accesses the physical memory with different 
stride, and the size of physical memory and the stride 
is recorded. Besides, when the number of accessed data 
blocks exceeds the number of the selected cache sets, 
serious conflict will cause average access latency to 
increase sharply. 

As presented in figure [4] average access latency in¬ 
creases sharply at one point. This point represents the 
configuration of the test, including array size and stride. 


With each configuration, the number of data blocks 
accessed is decided by array size and stride. As shown in 
figure [T] when the stride is larger than 32KB, the number 
of cache lines LLC can hold equals to 120. Chances are 
that some bits in physical address serves as set index 
during cache access. When the stride is large enough, 
the set index will remain unchanged for the data blocks 
accessed. Considering the fact that the associativity of 
the Sandy Bridge processor in this test is 20, and has 
6 x 20 = 120 cache sets. 

C. The hash function meets some properties 

As presented in dD, the set-associative organization 
cache should meet the following properties to provide 
better performance. (1) Equitability; (2) Local disper¬ 
sion; (3) Simple hardware implementation; 





































































Table 2. Processor parameters. 


CPU Type 

Intel®Xeon(R)Processor E5-2640 

Intel®Xeon(R)Processor E5-2603 

Intel CPU core 

6 cores@2.5G 

4 cores@2.5G 

LI I-Cache 

32kB/core, 8-way, 64B line 

32kB/core, 8-way, 64B line 

LI I-Cache 

32kB/core, 8-way, 64B line 

32kB/core, 8-way, 64B line 

L2 Cache 

256kB/core, 8-way, 64B line 

256kB/core, 8-way, 64B line 

L3 Cache 

15360kB(shared, 6 slices); 20-way, 64B line 

10240kB(shared, 4 slices); 20-way, 64B line 

Memory Capacity 

64GB 

16GB 


V. Assumption 

Let risiice be the total number of slices and n set be the 
number of sets on each slice, (a n _i,..., ao) be the bi¬ 
nary representation of the physical address. As presented 
in figure [6j this article splits the binary representation of 
an address pa into bit substrings (A 2 , A 1 , A 0 ), A 0 is a c 
bit string: the displacement in the line. A\ is a n_slice 
bit string and A 2 is the string of the most significant bits. 
Based on the work presented in 0, this article makes 
the following assumptions: 

1) the value of Ao is used as block address; 

2) the value of A\ is used as set index on a specified 
cache slice; 

3) A 2 is used to decide the slice_id of a given pa; 


N-l 0 



Fig. 6: Split physical address into different fields. 

VI. Methodology 

This section discusses the following three questions: 
(1) the criteria which is used to classify all the physical 
addresses; (2) the mechanism to ensure the correctness 
of the criteria; (3) the method to get the information 
needed to perform the classification of data blocks. 

A. Platform 

Table [2] presents the parameters of the processors used 
in this article. 

B. Classifying criteria 

Evicting relationship exists between data block s that 
are mapped to the same cache set. When the number of 
accessed data blocks exceeds the associativity of cache 
set, cache conflict occur, these data block will begin to 
evict each other, as presented in figure [7] This article 
uses this evicting relationship to classify data block into 
different groups. 


C. Getting the evicting relationship between data block 

1) Accessing data in desired cache set: In order to 
get the evicting relationship, the testing program should 
be able to fill data into specified cache set. We boot the 
operating system with 2GB memory. In this way, the 
OS can only use the lower 2GB memory. This article 
further implement a driver to map the other physical 
memory ranging from 2GB to 3GB into kernel space. In 
this way, the physical address accessed can be calculated 
by subtracting based address from the linear address of 
the operating system. 

Table 3. Use combinations of bits to generate addresses 


(a) bit a 2 , 

ai, ao 

(b) bit a i8 , 

ai7, ai6 

Bit value 

Value 

Bit value 

Value 

000 

0x0 

000 

0x00000 

001 

0x1 

001 

0x10000 

010 

0x2 

010 

0x20000 

011 

0x3 

011 

0x30000 

100 

0x4 

100 

0x80000 

101 

0x5 

101 

0x90000 

110 

0x6 

110 

OxaOOOO 

111 

0x7 

111 

OxbOOOO 


2 ) Test sequence generation: As presented in table |3j 
the physical address is generated by bit combination. In 
this way, this article verifies the effect of every bit on 
the result of hash function. 

3) Array initialization: The testing program firstly 
allocates an array. Then the testing program initializes 
the test array in a data-dependent manner, which means 
that the data stored at the current read address is the 
address of the next read command. As shown in Figure [8j 
the main operation of the test program is to read data 
and use the data as the address of the following read 
operation. 

4) Ensure the correctness of evicting relationship: 
As depicted in code snippet [2| idle loop is inserted 
between adjacent memory accesses to make sure that 
only two physical addresses with eviction relationship 
exists between each other are temporally adjacent. 

5) Collecting memory reference trace: HMTT is a 
hybrid hardware/software memory trace monitoring sys- 
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Fig. 7: Extract evicting relationship between data block from memory reference trace that HMTT collects: (1) The 
array is initialized in a data dependent manner; (2) associativity data blocks are read into cache; (3) the next data 
block is going to be read; (4) one cache line is evicted before the next data block can be read into cache; (5) next 
data block is read into cache. 


while (j < 

iteration ) { 

addr = 

*(TYPE*)( addr ); 

j++; 


} 



Listing 1: The main opration of test program without 
interval between consecutive acesses. 


while (j < iteration) { 

addr = *(TYPE*)( addr ); 
//modify the cache line 

*(TYPE*)( addr+ 16) = 1; 
while (k++ < 1000); 

j++; 

k = 0; 

} 


Listing 2: The main opration of test program with 
interval between consecutive acesses. 


tern. This tool collects all the memory reference trace, 
which memory controller issues to memory modules. 


Although the tool can be programmed to collect various 
information 0, the information needed in this article 
includes physical address, time interval between different 
consecutive physical addresses and read/write bit of the 
physical address, as presented in table [4] The whole 
process is presented in figure [7] When the cache set 
is full and testing program reads another data block 
into cache, one of the data blocks already read into 
cache needed to be evicted to make room for the newly 
read data block. The two operations are collected by 
HMTT and save as two memory reference trace. The 
trace consists of information about the address of the 
operation and whether the operation is read or write. 

6) Classify all the physical address into different 
groups: As presented in figure |9(b) if evicting rela¬ 
tionship exists between physical address A and B, and 
also exists between physical address B and C, then it is 
concluded that evicting relationship also exists between 
physical address B and C. This article further classifies 
all the physical addresses into different groups based on 
connected subgraph related method. 

































Table 4. Sample memory reference trace. 


Seq 

Read or Write 

Physical Address 

Interval 

1 

read 

bfd60000 

15 

2 

write 

belaOOOO 

1094 

3 

read 

bfd80000 

15 

4 

write 

be4a0000 

608 

5 

read 

bfdaOOOO 

9 

6 

write 

be500000 

1206 

7 

read 

bfdcOOOO 

20 

8 

write 

bef40000 

1090 


|^—8 bytes—►! 




| Addrss | 

Make dirty 1 j 

field | | 

i i i 

j j 


Fig. 8: The test array is initialized using shuffled ad¬ 
dresses in a data-dependent manner. 


VII. Results 

This article presents the mapping relationship in the 
form of mapping table. It’s true with both Sandy Bridge 
4 core and 6 core processor that bit string Al in physical 
address selects cache set directly. For both processor, 
there are n set cache sets per cache slice. The number of 
data blocks that are mapped to cache set with the same 
set index is 

memory 

Cdatablock X Tl se i 

Let C memory be the installed memory, let C capac i ty the 
size of data block, and let nu oc k be total number of data 
blocks, due to the fact that, as presented in figure [T0(a) 
substring Al in physical address is used to select cache 
set, for those data blocks those share the same set index, 
they will be mapped to cache sets on all cache slice 
sharing the same cache set index, the total number of 
cache slice is n s u ce . For those blocks sharing the same 
set index, this article uses a mapping table to describe 
the relationship. 


A. 4 core processor 

On Sandy Bridge 4 core processor, the installed mem¬ 
ory is 16GB, so the total data block is 


^block 


a, 

c, 


memory 


capacity 


1 6 GB _ o28 
64 B 


( 6 ) 


The number of cache sets on each cache slice is 2048, 
the number of data block that share the same cache set 
index is 


n block 2 _ ^17 

n set 2 11 


(7) 


These data blocks will be mapped to these selected 
cache sets. As presented in table [6] as all the blocks 
share the same set index, this article only presents the 
A2 string of each block address here. Each set index 
corresponds to a mapping table. This article finds that on 
Intel 4 core processor, the mapping tables corresponds 
to different set index are the same. 

The installed physical memory is 16GB, so we have 
the mapping table of 34 bit width physical address. 
Because bit string Al in physical address selects cache 
set directly, there are 2048 cache sets on each cache slice. 
As a result, there should have been 2048 mapping table 
to describe the relationship. However, this article finds 
that the mapping table is the same for all set index. 

Reduction of Sandy Bridge processor mapping table. 
The mapping table can be reduced to a simple formula. 
As presented in figure |10(b)| two intermediate value [5j 
bit_a o, bit_a \, these four value is related to four differ¬ 
ent cache slices. The value of bitisbitis is used to select 
one set from from four cache sets selected by set index 
of the data block. 


B. 6 core processor 

On Sandy Bridge 6 core processor, the installed mem¬ 
ory is 64GB, so the total data block is 


nblock 


Cmemory 
Cdatablock 


64GB 
64 B 


= 2 30 


( 8 ) 


Because there are 2048 cache sets on each slice, the 
number of data block sharing the same cache set index 
is 


m o30 

u block _ ^ _ 219 

W'set 2 11 


(9) 


However, on 6 core processor, cache set with different 
cache set index might have different mapping table. 
There are 2048 cache set on each set index. As a result, 
there exist 2048 mapping tables corresponding to 2048 
set indexes. After further analysis, this article has tested 
every set index with 1GB physical memory (30 bits 
physical address). As presented in table [8j the result 
shows that the mapping table of some set indexes are 
the same. Set index 0, 2, 65, 67 share the same mapping 
table. And There are 32 different mapping table. Set 
index ranging from 0 to 2047 fall into 32 mapping tables. 


Higher physical address bits also affect the result of 
hash function. On Sandy Bridge 6 core processor, this 
article has verified each set index with 1GB physical 
memory. Let (a n _i,..., ao) be the binary representation 









































bit_ao = getjbit (A2, 0) ® getjbit (A2, 1) ® getjbit (A2, 2) ® getjbit (A2, 3) ® getjbit (A2, 4) ® 

getjbit (A2, 5) ® getjbit (A2, 7) ® getjbit (A2, 9) ® getjbit (A2, 10) ® (getjbit ( A2 , 12) & 

getjbit (A2, 14)) ® ( getjbit (A2, 14)) & getjbit (A2, 13) ) 


bitjai = getjbit (A2, 0) ® getjbit (A2, 2) ® getjbit (A2, 4) ® getjjit (A2, 6) ® getjbit (A2, 8) ® 
getjbit (A2, 10) ® getjbit (A2, 11) ® getjbit (A2, 13) ® 

(getjbit (A2, 14) & getjbit (A2, 13) & getjbit (A2, 12)) 

Table 5. Two intermediate value used to reduce Sandy Bridge 4 core mapping table. 
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(a) Breadth first search. 
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(b) Represent data block with a node, and Represent 
evicting relationship with an edge. 




(c) The data blocks are classified into different groups. 

Fig. 9: This article uses connected subgraph method to solve the problem of classifying data block . 
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(a) Figure A (b) Figure B 

Fig. 10: Intel Sandy Bridge 4 core processor hash fucntion cracking result. 


of physical address, with 1GB physical memory tested, 
this article gets the result of bits ( 029 ,..., ao). When it 
comes to the higher address bits (< 135 ,..., < 230 ), this arti¬ 
cle verifies with 64GB(36 bits physical address) physical 
memory. The result shows the higher bits also affects the 
result of hash function. However, the data blocks sharing 
the same set index can still be split into 6 groups. This 
means that substring A 2 is used to select the slicejid of 
a given pa. 

VIII. Verification of the correctness of the 

CRACKING RESULT OF SANDY BRIDGE CACHE HASH 
FUNCTION 

Fig. 11: The physical addresses sharing the same set a. Object of correctness verification 

index are divided into into six groups using Cytoscape. ^ ^ , . r ^ 

& r & J r This article proposes a method to verily the correct¬ 

ness of the cracking result. The cracked function will 
















Table 6. 4 core cache hash mapping table 


Slice 0 

Slice 1 

Slice 2 

Slice 3 

4000 

4001 

4002 

4003 

4007 

4006 

4005 

4004 

4009 

4008 

400b 

400a 

400e 

400f 

400c 

400d 

4013 

4012 

4011 

4010 

4014 

4015 

4016 

4017 

401a 

401b 

4018 

4019 

401d 

401c 

40 If 

401e 

4021 

4020 

4023 

4022 

4026 

4027 

4024 

4025 

4028 

4029 

402a 

402b 

402f 

402e 

402d 

402c 

4032 

4033 

4030 

4031 

4035 

4034 

4037 

4036 

403b 

403a 

4039 

4038 

403c 

403d 

403e 

403f 


map different data blocks to different cache sets. This 
article verifies the correctness of the cracking result by 
checking that the data blocks that are indicated by the 
cracking result to be in one cache set are truly in one 
cache set. 




Fig. 13: Replacement policy affects the avearage latency. 


As presented in figure [13] average access latency in¬ 
creases sharply when the number of data blocks accessed 
exceeds the associativity of LLC. This is relatively 
obvious. This article finds that performing another write 
operation after the data block is read into cache will af¬ 
fect cache replacement policy. As presented in figure [13] 
polluting the cache line means that adding another write 
operation, and do not pollute the cache line means 
perform only dependent read operation. It can be seen 
from the figure that performing another read operation 
causes the average access latency to increase slowly 
compared to the read operation only configuration. One 
possible explanation is as shown figure [12] when a new 
cache line arrives, if the newly accessed data block is 
inserted into the MRU position, when the number of 
data blocks accessed exceeds the associativity of LLC, 
even if only exceeds by one, the average access latency 
will increase sharply. The method is able to check every 
cache set. 

Verify the correctness of the cracking result using 
two threads. In this verifying scenario, use two threads 
to perform data dependent access. The average access 
latency is recorded for each configuration. In the first 
configuration, the data blocks accessed by thread 1 and 
the data blocks accessed by thread 2 are mapped to 
different cache sets, the results is presented in figure [15] 
In the second configuration, the data blocks accessed by 
thread 1 and the data blocks accessed by thread 2 are 
mapped to different cache sets, the results is presented 
in figure [T4] For both configurations, the number of data 
blocks accessed by thread 2 vary from 1 to 40. And the 
number of data block accessed by thread 1 varies from 
1 to 4. When the number of data blocks accessed by 
thread 1 exceeds the associativity, average latency will 
increase sharply. When thread 1 performs data dependent 
access on 1 data block, in the first configuration, as the 
data blocks accessed by two threads are mapped to the 
same cache set and thread 1 accessed 1 data block, 
average access latency of thread 2 increases sharply 
when the number of accessed data blocks of thread 2 is 
18; On the contrary, when the data blocks accessed by 
two threads are mapped to different cache sets, average 
access latency of thread 2 increases sharply when the 
number of accessed data blocks of thread 2 is 18. This 
article gets similar results when the number of data 
blocks access by thread 1 varies from 1 to 4. 

Write operation has effect on cache replacement pol¬ 
icy. In the following test, the program performs data 
dependent access, two configuration of the program is 
as follows: (1) when the data block is read, perform a 
write operation to make the cache line dirty; (2) do not 
make the cache line dirty. 

Let L average be the average access latency, I/ memory 
be the access latency of memory, Lllc be the access 










































Table 7. Intel Sandy Bridge 6 core processor hash function cracking result. 


(a) 6 core cache hash mapping table, set index 1 


Slice 0 

Slice 1 

Slice 2 

Slice 3 

Slice 4 

Slice 5 

4000 

4001 

4002 

4003 

4007 

400e 

400d 

4006 

4005 

4004 

400a 

400f 

4017 

400c 

4008 

4009 

400b 

4014 

401a 

4016 

4012 

4013 

4010 

4015 

401d 

401b 

401f 

401e 

4011 

4018 

4023 

401c 

4026 

4027 

4024 

4019 

4029 

4022 

402b 

402a 

4025 

4020 

402e 

4028 

402c 

4030 

4032 

4021 

4034 

402f 

4031 

4037 

4033 

402d 

4039 

4035 

4036 

403d 

403e 

403a 

4041 

4038 

403c 

4042 

403f 

403b 

4046 

4040 

4043 

4045 

404c 

4044 

404b 

4047 

404e 

404f 

404d 

4048 

4051 

404a 

4054 

4055 

4056 

4049 

405c 

4050 

4059 

4058 

4057 

4052 

4065 

405d 

405e 

405f 

405a 

4053 

4068 

4064 

4060 

4061 

405b 

4066 

406f 

4069 

406a 

406b 

4062 

4067 

4072 

4073 

406d 

406c 

4063 

4070 

4075 

4074 

4077 

4076 

406e 

4071 

407f 

407e 

407a 

407b 

4078 

407c 

NULL 

NULL 

NULL 

NULL 

4079 

407d 


(b) 6 core cache hash mapping table, set index 2 


Slice 0 

Slice 1 

Slice 2 

Slice 3 

Slice 4 

Slice 5 

4000 

4002 

4003 

4004 

4006 

4007 

4001 

4008 

4009 

4005 

400b 

400a 

400c 

400f 

400e 

4012 

4011 

4010 

400d 

4015 

4013 

401e 

4016 

4017 

401a 

4018 

4014 

401f 

401c 

401d 

401b 

4021 

4019 

4026 

4022 

4023 

402e 

402c 

4020 

4027 

4025 

4024 

402f 

4036 

402d 

402a 

4028 

4029 

4034 

403b 

4037 

402b 

4032 

4033 

4035 

403c 

403a 

4030 

403f 

4039 

4038 

4044 

403d 

4031 

4040 

403e 

4046 

4049 

4045 

4042 

404a 

4041 

4047 

4053 

4048 

4043 

404d 

404b 

4051 

4054 

4052 

404e 

4050 

404c 

405c 

405e 

4055 

404f 

4057 

4056 

405d 

4060 

405f 

4058 

405a 

405b 

4064 

4067 

4061 

4059 

4063 

4062 

4065 

406a 

4066 

406c 

406e 

406f 

4068 

4070 

406b 

406d 

4074 

4075 

4069 

407a 

4071 

4076 

4079 

4078 

4072 

407d 

407c 

4077 

407e 

407f 

4073 

NULL 

NULL 

NULL 

407b 

407d 


Table 8. Mapping table corresponding to set index ranging from 0 to 127 shares 32 different mapping table. The situation is 
the same for set index ranging from 128 to 2047. 


Mapping table index 

Index 

Index 

Index 

Index 

Mapping table index 

Index 

Index 

Index 

Index 

1 

0 

2 

65 

67 

17 

32 

34 

97 

99 

2 

1 

3 

64 

66 

18 

33 

35 

96 

98 

3 

4 

6 

69 

71 

19 

36 

38 

101 

103 

4 

5 

7 

68 

70 

20 

37 

39 

100 

102 

5 

8 

10 

73 

75 

21 

40 

42 

105 

107 

6 

9 

11 

72 

74 

22 

41 

43 

104 

106 

7 

12 

14 

77 

79 

23 

44 

46 

109 

111 

8 

13 

15 

76 

78 

24 

45 

47 

108 

110 

9 

16 

18 

81 

83 

25 

48 

50 

113 

115 

10 

17 

19 

80 

82 

26 

49 

51 

112 

114 

11 

20 

22 

85 

87 

27 

52 

54 

117 

119 

12 

21 

23 

84 

86 

28 

53 

55 

116 

118 

13 

24 

26 

89 

91 

29 

56 

58 

121 

123 

14 

25 

27 

88 

90 

30 

57 

59 

120 

122 

15 

28 

30 

93 

95 

31 

60 

62 

125 

127 

16 

29 

31 

92 

94 

32 

61 

63 

124 

126 










(a) Thread 1 accesses 1 data block. (b) Thread 1 accesses 2 data blocks. 



Number of data blocks that are accessed 



Number of data blocks that are accessed 


(c) Thread 1 accesses 3 data blocks. 


(d) Thread 1 accesses 4 data blocks. 


Fig. 14: The data blocks accessed by two threads are mapped into the same cache set. 




(a) Thread 1 accesses 1 data block. 


(b) Thread 1 accesses 2 data blocks. 



Number of data blocks that are accessed 



Number of data blocks that are accessed 


(c) Thread 1 accesses 3 data blocks. 


(d) Thread 1 accesses 4 data blocks. 


Fig. 15: The data blocks accessed by two threads are mapped into two different cache set. 


latency of LLC, N be the associativity of LLC, n be 
the number of data blocks that are accessed sequentially 
by the testing program, then the relationship between 
average access latency and the number of data blocks 
accessed can be describes as: 

Lllc, if n < N 

L memory ( 1 ) + L llc , if n > TV 

( 10 ) 

Memory reference trace collected with HMTT offers a 
different prospective on cache replacement policy. This 
article further analyzes the trace: cut a segment of the 
whole trace, and count how many times each data block 
is accessed in this segment of trace (or this period of 


execution). As presented in figure [l6j without polluting 
the cache line, for simplicity, label each data block 
accessed with an index, the access number of each data 
block is uniform. However, when perform an extra write 
operation to pollute the cache line, the access number of 
different data blocks becomes non-uniform, which means 
that some of the data blocks are held in cache longer than 
the other part of data blocks. This fact reveals the fact 
that write operation affects the cache replacement policy. 

IX. ULCC 

User Level Cache Control (ULCC) 0 is a software 
package to implement cache partition using page color¬ 
ing. It improves performance of multi-threaded program 



















































































(a) 18 addresses 




(b) 19 addresses 



(c) 20 addresses 


(d) 21 addresses 





(e) 22 addresses 




(f) 23 addresses (g) 24 addresses 

Fig. 16: When test program performs data dependent access of different number of data blocks, the access number 
distribution of different data blocks in a period of time. 


by enforcing a user demand cache capacity allocation. 
By modifying the macro that extracts page color from 
physical address, this article ported ULCC to Intel Sandy 
processor. 

A. MergeSort 

MergeSort is implemented in multiple threads. During 
the execution of program, the intermediate result of every 
block is highly reused. ULCC improves the performance 
by allocating different cache capacity for data of differ¬ 
ent reuse degree. 

The performance of the program with and without 



Fig. 17: The execution time of MergeSort implementa¬ 
tions with and without ULCC support. 













































ULCC support is depicted in figure [IT] When choosing 
size of sorting block properly, the execution time is 
reduced by 20%. The result is using one thread to 
finish merge sort. So the performance gain is only from 
preventing cache pollution of data in one thread. 

B. MatMul 

The MatMul program multiplies two double precision 
matrices A and B, and produces the product matrix C. 
To achieve necessary data reuse in LLC, the matrix 
multiplication is carried out block by block. For the 
block a on the it h block row and jth block column of 
matrix A, it is multiplied with all the blocks on the jth 
block row of matrix B, and the results are accumulated 
into the blocks on the ith block row of matrix C. So 
the data in matrix A is of high reuse degree and before 
the program finishes the computation with block a, it 
is desirable that the data in a can be kept in the cache. 
However, without a dedicated space for block a, the data 
in it may be repeatedly evicted from the cache before 
its next use every time the program switches blocks in 
matrix B and matrix C, even with a rather small block 
size. To reduce the chance that the data in each block of 
matrix A is evicted from the last level cache prematurely, 
the size of sorting block should has a most suitable size. 

The performance of the program with and without 
ULCC support is depicted in figure [18] When choose 
data block element properly to make sure that the 
frequently used data can be held in cache, this article 
achieves the same performance improvement as pre¬ 
sented in 0 This further proves the correctness of the 
cracking results. 



Fig. 18: The execution time of MatMul implementations 
with and without ULCC support. 

X. General test 

This article proposed a method to crack hash function 
without the support of HMTT. The core of idea is that 
when the number of accessed addresses exceeds the 
associativity of cache set, the average access latency will 
increase sharply. 

The main procedures of the cracking method without 
support of HMTT is as follows: (1) Verify that some 
bits of physical address are used directly to select cache 


set; Divide all data blocks into different groups based on 
set index; (2) choose associativity data blocks that are 
mapped to one cache set from the data blocks sharing 
the same cache set index,; Access these data blocks 
sequentially, the average access latency will be close 
to the access latency of main memory; Put these data 
blocks in the classified group of data blocks; (3) Choose 
associativity — 1 data blocks from the classified group 
and choose one data block from the unclassified group of 
data blocks, perform data dependent access of these data 
blocks, there are two results: the first is that the average 
access latency is close to the latency of main memory, 
and this means that the one data block chosen from the 
unclassified group of data blocks is mapped to the same 
cache set with the other associativity — 1 data blocks; 
the second is that the average access latency is close to 
the latency of LLC, and this means that the one data 
block chosen from the unclassified group is not mapped 
to the same cache set with the other associativity — 1 
data blocks. (4) Perform step 2 and step 3 until all data 
blocks sharing the same set index have been labeled as 
classified or unclassified. (5) Choose another data block 
from the data blocks which are labeled as unclassified, 
start again from step 1; 

The problem with this method is that it takes too long 
to finish the test. 

XI. Conclusions 

On Intel Sandy Bridge processor, last level cache 
(LLC) is divided into cache slices and all physical 
addresses are distributed across all cache slices using 
a hash function. With this undocumented hash function 
existing, it is impossible to implement cache partition 
based on page coloring. 

This article cracks the hash function on two types 
of Intel Sandy Bridge processors. It’s true on both 4 
core and 6 core processors that bit substring of physical 
address is used to select cache sets. What is different 
is that: on Intel Sandy 4 core processor, the mapping 
relationship for different cache set indexes is the same. 
And on Intel Sandy Bridge 4 core processor, the cracked 
hash function is reduced to a simple formula. On the con¬ 
trary, on Intel Sandy Bridge 6 core processor, different 
cache set indexes have different mapping relationship. 
The article has not reduced the hash function to a simple 
formula. Instead, the hash function is presented in the 
form of mapping tables. 

This article proves that it’s possible to implement 
cache partition based on page coloring. On 4 core 
processor, based on the cracking result, it’s easy to 
implement cache partition based on page coloring. On 
6 core processor, without reducing the hash function 
to a simple formula, cache partition can at least be 








implemented based on set index, as bit substring of 
physical address is used to select cache sets. 
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